NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

APOLLO: SGD-like Memory AdamW-level Performance

Zhu, Hanqing; Zhang, Zhenyu; Cong, Wenyan; Liu, Xi; Park, Sem; Chandra, Vikas; Long, Bo; Pan, David Z; Wang, Zhangyang; Lee, Jinwon (May 2025, Conference on Machine Learning and Systems (MLSys))

Free, publicly-accessible full text available May 10, 2026
APOLLO: SGD-like Memory, AdamW-level Performance

Zhu, Hanqing; Zhang, Zhang; Cong, Wenyan; Liu, Xi; Park, Sem; Chandra, Vikas; Long, Bo; Zan, David Z; Wang, Zhangyang; Lee, Jinwon (February 2025, MLSys 2025)

Large language models (LLMs) are notoriously memory-intensive during training, particularly with the popular AdamW optimizer. This memory burden necessitates using more or higher-end GPUs or reducing batch sizes, limiting training scalability and throughput. To address this, various memory-efficient optimizers have been proposed to reduce optimizer memory usage. However, they face critical challenges: (i) reliance on costly SVD operations; (ii) significant performance trade-offs compared to AdamW; and (iii) still substantial optimizer memory overhead to maintain competitive performance. In this work, we identify that AdamW's learning rate adaptation rule can be effectively coarsened as a structured learning rate update. Based on this insight, we propose Approximated Gradient Scaling for Memory-Efficient LLM Optimization (APOLLO), which approximates learning rate scaling using an auxiliary low-rank optimizer state based on pure random projection. This structured learning rate update rule makes APOLLO highly tolerant to further memory reductions while delivering comparable pre-training performance. Even its rank-1 variant, APOLLO-Mini, achieves superior pre-training performance compared to AdamW with SGD-level memory costs. Extensive experiments demonstrate that the APOLLO series performs on-par with or better than AdamW, while achieving greater memory savings by nearly eliminating the optimization states of AdamW. These savings provide significant system-level benefits: (1) Enhanced Throughput: 3x throughput on an 8xA100-80GB setup compared to AdamW by supporting 4x larger batch sizes. (2) Improved Model Scalability: Pre-training LLaMA-13B with naive DDP on A100-80GB GPUs without system-level optimizations. (3) Low-End GPU Friendly Pre-training: Pre-training LLaMA-7B on a single GPU using less than 12 GB of memory with weight quantization.
more » « less
Free, publicly-accessible full text available February 17, 2026
ScaleNAS: Multi-Path One-Shot NAS for Scale-Aware High-Resolution Representation

Cheng, Hsin-Pai; Liang, Feng; Li, Meng; Cheng, Bowen; Yan, Feng; Li, Hai; Chandra, Vikas; Chen, Yiran (July 2023, The AutoML Conference 2022)

Full Text Available
Contrastive quant: quantization makes stronger contrastive learning

https://doi.org/10.1145/3489517.3530419

Fu, Yonggan; Yu, Qixuan; Li, Meng; Ouyang, Xu; Chandra, Vikas; Lin, Yingyan (July 2022, DAC '22: Proceedings of the 59th ACM/IEEE Design Automation Conference)

Contrastive learning learns visual representations by enforcing feature consistency under different augmented views. In this work, we explore contrastive learning from a new perspective. Interestingly, we find that quantization, when properly engineered, can enhance the effectiveness of contrastive learning. To this end, we propose a novel contrastive learning framework, dubbed Contrastive Quant, to encourage feature consistency under both differently augmented inputs via various data transformations and differently augmented weights/activations via various quantization levels. Extensive experiments, built on top of two state-of-the-art contrastive learning methods SimCLR and BYOL, show that Contrastive Quant consistently improves the learned visual representation.
more » « less
Full Text Available
DepthShrinker: A New Compression Paradigm Towards Boosting Real-Hardware Efficiency of Compact Neural Networks

Fu, Yonggan; Yang, Haichuan; Yuan, Jiayi; Li, Meng; Wan, Cheng; Krishnamoorthi, Raghuraman; Chandra, Vikas; Lin, Yingyan (June 2022, Thirty-ninth International Conference on Machine Learning (ICML 2022))

Full Text Available
ScaleNAS: Multi-Path One-Shot NAS for Scale-Aware High-Resolution Representation

Cheng, Hsin-Pai; Liang, Feng; Li, Meng; Cheng, Bowen; Yan, Feng; Li, Hai; Chandra, Vikas; Chen, Yiran (January 2022, Proceedings of the AutoML Conference 2022 (co-located with ICML 2022) (AutoML 2022))

Full Text Available
DIAN: Differentiable Accelerator-Network Co-Search Towards Maximal DNN Efficiency

https://doi.org/10.1109/ISLPED52811.2021.9502478

Zhang, Yongan; Fu, Yonggan; Jiang, Weiwen; Li, Chaojian; You, Haoran; Li, Meng; Chandra, Vikas; Lin, Yingyan (July 2021, 2021 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED))

Full Text Available
KeepAugment: A Simple Information-Preserving Data Augmentation Approach

Gong, Chengyue; Wang, Dilin; Li, Meng; Chandra, Vikas; Liu, Qiang (January 2021, Conference on Computer Vision and Pattern Recognition (CVPR))

Data augmentation (DA) is an essential technique for training state-of-the-art deep learning systems. In this paper, we empirically show that the standard data augmentation methods may introduce distribution shift and consequently hurt the performance on unaugmented data during inference. To alleviate this issue, we propose a simple yet effective approach, dubbed KeepAugment, to increase the fidelity of augmented images. The idea is to use the saliency map to detect important regions on the original images and preserve these informative regions during augmentation. This information-preserving strategy allows us to generate more faithful training examples. Empirically, we demonstrate that our method significantly improves upon a number of prior art data augmentation schemes, e.g. AutoAugment, Cutout, random erasing, achieving promising results on image classification, semi-supervised image classification, multi-view multi-camera tracking and object detection.
more » « less
Full Text Available
Heterogeneous Dataflow Accelerators for Multi-DNN Workloads

https://doi.org/10.1109/HPCA51647.2021.00016

Kwon, Hyoukjun; Lai, Liangzhen; Pellauer, Michael; Krishna, Tushar; Chen, Yu-Hsin; Chandra, Vikas (February 2021, 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA))

Full Text Available
One Weight Bitwidth to Rule Them All

Chin, Ting-Wu; Chuang, Pierce; Chandra, Vikas; Marculescu, Diana (August 2020, European Conference on Computer Vision Workshops)

Weight quantization for deep ConvNets has shown promising results for applications such as image classification and semantic segmentation and is especially important for applications where memory storage is limited. However, when aiming for quantization without accuracy degradation, different tasks may end up with different bitwidths. This creates complexity for software and hardware support and the complexity accumulates when one considers mixed-precision quantization, in which case each layer’s weights use a different bitwidth. Our key insight is that optimizing for the least bitwidth subject to no accuracy degradation is not necessarily an optimal strategy. This is because one cannot decide optimality between two bitwidths if one has smaller model size while the other has better accuracy. In this work, we take the first step to understand if some weight bitwidth is better than others by aligning all to the same model size using a width-multiplier. Under this setting, somewhat surprisingly, we show that using a single bitwidth for the whole network can achieve better accuracy compared to mixed-precision quantization targeting zero accuracy degradation when both have the same model size. In particular, our results suggest that when the number of channels becomes a target hyperparameter, a single weight bitwidth throughout the network shows superior results for model compression.
more » « less
Full Text Available

« Prev Next »

Search for: All records